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Abstract 

In recent years many sparse linear discriminant analysis methods have been proposed for 
high-dimensional classification and variable selection. However, most of these proposals focus 
on binary classification and they are not directly applicable to multiclass classification prob¬ 
lems. There are two sparse discriminant analysis methods that can handle multiclass classifi¬ 
cation problems, but their theoretical justifications remain unknown. In this paper, we propose 
a new multiclass sparse discriminant analysis method that estimates all discriminant directions 
simultaneously. We show that when applied to the binary case our proposal yields a classi¬ 
fication direction that is equivalent to those by two successful binary sparse LDA methods 
in the literature. An efficient algorithm is developed for computing our method with high¬ 
dimensional data. Variable selection consistency and rates of convergence are established un- 
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der the ultrahigh dimensionality setting. We further demonstrate the superior performance of 
our proposal over the existing methods on simulated and real data. 

Keywords: Discriminant analysis; High dimensional data; Variable selection; Multiclass clas¬ 
sification; Rates of convergence. 


1 Introduction 

In multiclass classification we have a pair of random variables {Y, X), where X G and Y G 
K}. We need to predict Y based on X. Define tt^ = Pr(y = k). The linear discriminant 
analysis model states that 


X \ {Y = k) ^ e {1,2,... ,K}. 


( 1 ) 


Under O, the Bayes rule can be explicitly derived as follows 


Y = argmax{(X - (3k + logtr^}, 

k Z 


where (3k = ^1 ^fik for k = 1,..., K. Linear discriminant 


very well on many low-dimensional datasets (IMichie et al. 


analysis has 




( 2 ) 


?een o bserved to perform 


Hand 


2006h . However, it may 


not be suitable for high-dimensional datasets for at least two reasons. First, it is obvious that linear 


discriminant analysis cannot be applied if the dimens ion p exceeds the 


the sample covariance matrix will be singular. Second, 


sample size n, because 


Fan & FanI (120081) showed that even if the 


true covariance matrix is an identity matrix and we know this fact, a classifier involving all the 
predictors will be no better than random guessing. 

In recent y ears, many high-dimensional generalizations of linear discriminant analysis have 


been proposed (ITibshirani et al. 


2002 , 


Trendafilov & .lolliffe 2007. 


Clemmensen et al. 


2011 , 


Fan & Fan 
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2008 

Wu et al. 

Fan et al. 

2012) 


2008. Wu et alJbOOSLIShao et alJ201 iLICai & Liull201 iLIWitten & Tibshiranil201 iLlMai et alJ2012L 


20121) . In the binary case, the discriminant direction is /3 = S ^(/X 2 — Mi)• One can seek 


sparse estimates of /3 to generalize linear discriminant analysis to deal with high dimensional clas¬ 
sification. Indeed, this is the common fea ture of three pop ular sparse discriminant analysis meth¬ 


ods: t 


le linear program ming discriminant dCai & Liull201 ih . the r egularized optim al affine discrim¬ 


inant (IFan et al.ll2012h and the direct sparse discriminant analysis (IMai et al.ll2012h. The linear pro¬ 


gramming discriminant finds a sparse e stimate by the D antzig selector (ICandes & Taoll2007h: the 


regularized optimal affine discriminant (IFan et al. 


20121) adds the lasso penal ty (ITibshirani 


to Fisher’s discriminant analysis; and the direct sparse discriminant analysis (IMai et al 


1996h 


2012) de 


rives the sparse discriminant direction via a sparse penalized least squares formulation. The three 
methods can detect the important predictors and consistently estimate the classification rule with 
overwhelming probabilities with the presence of ultrahigh dimensions. However, they are explic¬ 
itly designed for binary classification and do not handle the multiclass case naturally. 


Two popu 


ar multiclass sparse discri minant analysis proposals ar e the penalized Fisher’ s 


discriminant dWitten & Tibshirani 


201 11) and sparse optimal scoring 


Clemmensen et al. 


(l201l[) . 


However, these two methods do not have theoretical justifications. It is generally unknown whether 
they can select the true variables with high probabilities, how close their estimated discriminant 
directions are to the true directions, and whether the final classifier will work similarly as the Bayes 
rule. 

Therefore, it is desirable to have a new multiclass sparse discriminant analysis algorithm that 
is conceptually intuitive, computationally efficient and theoretically sound. To this end, we pro¬ 
pose a new sparse discriminant method for high-dimensional multiclass problems. We show that 
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our proposal not only has competitive empirieal performanee but also enjoys strong theoretieal 


properties under ultrahigh dimensionality. In Seetion 2 we introduee the details of our proposal 


after briefly reviewing the existing two proposals. We also develop an effieient algorithm for our 
method. Theoretieal results are given in Seetion 3. In Seetion 4 we use simulations and a real 
data example to demonstrate the superior performanee of our metho d over sparse optimal seor - 


ing dClemmensen et al.ll201 It) and penalized Fisher’s diseriminant dWitten & Tibshiranill201lh 


Teehnieal proofs are in an Appendix. 


2 Method 


2.1 Existing proposals 

The Bayes rule under a linear diseriminant analysis model is 

Y = argmax{(X - f3k + logTr^}, 

k Z 

where jSk = for fc = 1,..., K. Let for fc = 1,..., iT. Then the Bayes 

rule ean be written as 


Y = argmax{(0^ 


Bayes \T 


;X- + log7rfc}. 


(3) 


We refer to the direetions = {0^^^^^% ..., e as the diseriminant direetions. 

We briefly review two existi ng multielass sparse diseriminant methods: the sparse optimal 


scoring dClemmensen et al. 


201 ih and the Y penalized Fisher’s discriminant dWitten & Tibshirani 


201 ih . Instead of estimating 0^^^^^ directly, these two methods estimate a set of directions rj = 


(rji ,..., rjK-i) € such that rj spans the same linear subspace as 0^^y^ and hence linear 
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discriminant analysis on X'^ry will be equivalent to ([3]) on the population level. More specifically, 
these two methods look for estimates of ry = (ryi,..., in Fisher’s discriminant analysis: 

rjk = argmaxry^Sftryfc, s.t. ry^Sryfc = 1, = 0forl <k, (4) 

where = ;^ Ef=i(Mfc - - A)^ with A = ^ Efc Mfc- 

With a little abuse of terminology, we refer to ry as discriminant directions as well. To find r], 
define as an n x iT matrix of dummy variables with Yj™ = l{Yi = k). 

In addition to the discriminant direction rjk, sparse optimal scoring creates K — 1 vectors of 
scores ai,..., ock-i G Then for /c = 1,..., iT — 1, sparse optimal scoring estimates ry^ 
sequentially. In each step, sparse optimal scoring finds Tyf°®- Suppose the first k — 1 score 
vectors cti, I < k and discriminant directions I < k are available. Then sparse optimal 

scoring finds Afc, ryf*^^ by solving the following problem: 

n 

{oLk, = arg min - XTy^)^ + A||ryfc||i (5) 

2=1 

s.t. = l^cxliY^^YY'^'^ai = 0, for any I < k, 

n 

where X is the centered data matrix, and A is a tuning parameter. The sparse optimal scoring is 
closely related to (@1), because when the dimension is low, the unpenalized version of ([5]) gives the 
same directions (up to a scalar) as dH) with the parameters Xfe and S substituted with the sample 
estimates. Therefore, with the Y penalty, sparse optimal scoring gives sparse approximations to rj. 

Note that the constraint q:^(Y^™)'^Y'^™q:/ = 0,( < A: indicates that, (Q:fc,fy|°®) depends on 
the knowledge of (a^, fjf^Y ,l < k. This is why we say that the sparse optimal scoring adopts a 
sequential approach to estimate the discriminant directions. 
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The i\ penalized Fisher’s diseriminant analysis estimates rjk by 


rjk = argmaxry^S^ryfc + \^jVkj \ s.t. < 1, 

j 

for k = 1,..., K — 1, where Afc are tuning parameters, (j| is the (j, j)th element of the sample 
estimate of S, S is a positive definite estimate of H, 




( 6 ) 


and Ofc is the identity matrix if /c = 1 and otherwise an orthogonal projeetion matrix with eolumn 
spaee orthogonal to ((Y^™)'^Y)“^/^Y'^X'i 7 / for all / < k. Again, if the dimension is low, then 
unpenalized version of ® is equivalent to @ with the parameters replaced by the sample estimates. 
Since Ctk relies on 7)1 for all I < k, the ii penalized Fisher’s discriminant analysis also finds the 
discriminant directions sequentially. 


2.2 Our proposal 

Good empirical results have been reported for supporting the ii penalized Fisher’s discriminant 
analysis and the sparse optimal scoring. However, it is unknown whether either of these two clas¬ 
sifiers is consistent when more than two classes are present. Moreover, both sparse optimal scoring 
and ii penalized Fisher’s discriminant analysis estimate the discriminant directions sequentially. 
We believe a better multiclass sparse discriminant analysis algorithm should be able to estimate 
all discriminant directions simultaneously, just like the classical linear discriminant analysis. We 
aim to develop a new computationally efficient multiclass sparse discriminant analysis method that 


enjoy strong theoretical properties under ultrahigh dimensionality. Such a method can 
as a natural multiclass counterpart of the three binary sparse discriminant methods in 


3e viewed 


Mai et al 
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(120 1 2h . ICai & Lid (120 1 11) and iFan et alJ (12012h . 


To motivate our method, we first diseuss the implieation of sparsity in the multielass problem. 
Note that, by ([3]), the eontribution from the jth variable {Xj) will vanish if and only if 


^Bayes _ . . . _ ^Bayes _ g 


(7) 


Let V = {j eondition ([7]) does not hold}. Note that whether an index j belongs to V depends on 
6kj for all k. This is beeause k = 2,..., K are related to eaeh other, as they are eoeffieients 
for the same predietor. In other words, k = 2,... ,K are naturally grouped aeeording to 

j. Then the sparsity assumption states that |P| <C p, whieh is referred to as the eommon sparsity 
strueture. 

Our proposal begins with a eonvex optimization formulation of the Bayes rule of the multielass 
linear diseriminant analysis model. Reeall that for k = 2,..., K. On the 

population level, we have 

( 02 ..., = arg min ^{^ 0 ^X 0 ^ - (^fc - ( 8 ) 

k=2 

In the elassieal low-dimension-large-sample-size setting, we ean estimate (02 ..., 0^^^^®^) via 
an empirieal version of ([ 8 ]) 


K 


1 


( 02 ,..., 0i^) = arg min '^{-Ol'EOk - {fik - fiiTOk}, 


(9) 


fc =2 


1 1 

where S = -— Yl!k=\ - Afc)(X* - Aa:)"^, Afc = — X* and nk is the sam- 

n — K rik 

pie size within Class k. The solution to ([91) gives us the elassieal multielass linear diseriminant 
elassifier. 

For presentation purpose, write 6 j = ( 6 ^ 2 ^,..., OxjY and define ||0.AI = 
the high-dimensional ease, we propose the following penalized formulation for multielass sparse 
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discriminant analysis. 


K 


( 02 ,---, 0 ir) = arg min - (/ifc - + A ^ ||0.j|l, (10) 

02,...,0k —* 

J=1 


k=2 


where A is a tuning param eter. It is elear that (fTOl) is based on (|9l). In (fTOl) we have used the group 


lasso (Yuan & Lin 


20061) to eneourage the eommon sparsity strueture. Let V = {j : 9kj 7 ^ 0} 


whieh denotes the set of seleeted variables for the multielass elassifieation problem. We will show 
later tha t with a high pr obability V equals V. One ean als o use a grou p version of a nonconvex 


penalty (IFan & Lill200ll) or an adaptive group lasso penalty (lBachll2008l) to replaee the group lasso 
penalty in (fTOl) . To fix the main idea, we do not pursue this direetion here. 

After obtaining 0^, /c = 2,..., iT, we fit the elassieal multielass linear diseriminant analysis on 
(X'^025 • • •, as in sparse optimal seoring and ii penalized Fisher’s diseriminant analysis. 

We repeat the procedure for a sequenee of A values and pick the one with the smallest cross- 
validation error rate. 

We would like to make a remark here that our proposal is derived from a different angle than 
sparse optimal seoring and ii penalized Fisher’s diseriminant analysis. Both sparse optimal seoring 
and ii penalized Fisher’s diseriminant analysis penalize a formulation related to Fisher’s diserimi¬ 
nant analysis in dH), while our method direetly estimates the Bayes rule. This different angle leads 
to eonsiderable convenience in both computation and theoretical studies. Yet we can easily recover 
the directions defined by Fisher’s discriminant analysis after applying our method. See Section A.l 


for details. 
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2.3 Connections with existing binary sparse LDA methods 

Although our proposal is primarily motivated by the multiclass classification problem, it can be 
directly applied to the binary classification problem as well by simply letting A' = 2 in the formu¬ 
lation (fTOl) . It turns out that the binary special case of our proposal has very intimate connections 
with some proven successful binary sparse LDA methods in the literature. We elaborate more on 
this point in what follows. 

When K = 2, (fTOl) reduces to 

=argmin-0'^S0- (Aa-Ai)^6' + A||0||i (11) 

6 2 

Considering the Dantzig selector formulation of (fTTl) . we have the following constrained mini¬ 
mization estimator defined as 


6 = argmin ||0||i s.t. ||S0 — {(i 2 — /ii)||oo < A. 
0 


( 12 ) 


The above estimator is exactly the linear programming discriminant ILPDl lCai & Lid (1201 ih . 
Moreover, we compare (fTTl) with another two well-known sparse discriminant analysis propos - 


als for binary classification: the regularized optimal affine discrimin ant (RQAD KIFan et al.ll2012h 


and the direct sparse discriminant analysis (DSDA) dMai et al. 


2012 ). Denote the estimates of the 


discriminant directions given by ROAD and DSDA as ^dsda^ respectively. Then we 

have 


0ROAD(^) 

= argmin0'^S0 -f A| 0| i s.t. {fi 2 — fii) = 1 

0 

(13) 

^DSDA(^) 

= argminV(r*-0o-(XT^)^ + A||0||i 

0 * ^ 

(14) 


i 


We derive the following proposition to reveal the connections between our proposal {K = 2) 
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and ROAD, DSDA. Note that the proofs of this proposition and all the subsequent lemmas and 
theorems ean be found in the appendix. 

Proposition 1. Define co(A) = — Ai))Ci(A) = — fii) and a = 

Then we have 

I CO(A) I 

0 MSDA(^) = co(A)0^o^°(A/|co(A)|), (15) 

ci(aA) 

Proposition 1 shows that the classification direction by our proposal is identical to a classi¬ 
fication direction by ROAD and a classification direction by DSDA. Consequently, our proposal 
[K = 2) has the same solution path as ROAD and DSDA. 

2.4 Algorithm 

Besides their solid theoretical foundation, LPD, ROAD and DSDA all enjoy computational effi¬ 
ciency. In particular, DSDA’s computational complexity is the same as fitting a lasso linear regres¬ 
sion model. In this section we show that our proposal for the multiclass problem can be solved by 
a very efficient algorithm. In light of this and Proposition 1, our proposal is regarded as the natural 
multiclass generalization of these successful binary sparse LDA methods. 

We now present the efficient algorithm for solving (fTOl) . For convenience write 6 ^ = fik — fii- 
Our algorithm is based on the following lemma. 

Lemma 1. Given {0.ji,j' Y solution ofO j to (fTOl) is defined as 

K 1 „ y 

argmin^ - hjf + —ll^ j|| (IV) 
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where Q . = ^ OxjY ||0.j|| = (J 2^=2 ^kj)^''^- d'de solution to 

dn]) is given by 


0 , = 0 , I 1 - 


A 


\o,\ 


(18) 


Based on Lemma [T| we use the following bloekwise-descent algorithm to implement our mul- 
tielass sparse diseriminant analysis. 


Algorithm 1 (Multielass sparse diseriminant analysis for a given penalization parameter). 

1. Compute S and d^, k = 1,2, ■ ■ ■, K,' 

2. Initialize 9^^^ and compute 9^'^ accordingly; 

3. For m = 1,do the following loop until convergence: for j = 1,... ,p, 


(a) compute 
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(m-l) 





(b) update 


4. 


Okj — -:-• 

Let 9k be the solution at convergence. The output classifier is the usual linear discriminant 
classifier on (X'^ 02 , • • •, ^^9^)- 


We have implemented our method in an R paekage ms da which is available on CRAN. Our 
package also handles the version of (fTOl) using an adaptive group lasso penalty, because both 
Lemma 1 and Algorithm 1 can be easily generalized to handle the adaptive group lasso penalty. 
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3 Theory 


In this section we study theoretical properties of our proposal under the setting where p can be 
much larger than n. Under regularity conditions we show that our method can consistently select 
the true subset of variables and at the same time consistently estimate the Bayes rule. 

We begin with some useful notation. For a vector cx, ||q:||oo = maxj \aj\, ||q:||i = \^j\^ 

while, for a matrix f2 G ||fi||oo = max* ll^lli = maxj Yli - Define 

p = max{||Si,c^i,||oo, A = max{||/x||i, ||0||i}; 

^min min ^max maX 

l|Sx>C,X)Sp^2?lloo = V* ■ 


Let d be the cardinality of V. 

Define G as the subgradient of the group lasso penalty at the true 9x) and we 

assume the following condition: 


(CO) = k < 1. 


Condition (CO) is required to guarantee the selection consistency. A con dition simila r to condition 


(CO) has been used to study the group lasso penalized regression model (IBach 


2008). 


We further let p, A,p*, nhe fixed and assume the following regularity conditions: 


Cl C*i 

(Cl) There exists ci, Ci such that — < vr^ < — for /c = 1, 

K K 

log (pd) 

(C2) n,p,^oo and-^ 0; 

n 

(C3) » {U2iM,i/x 

n 


,i^and§^<Ci 

”min 


12 











(C4) mirifc — 0 ^/)}^^'^ is bounded away from 0. 


Condition (Cl) guarantees that we will have a deeent sample size for eaeh elass. Condition (C2) 
requires that p eannot grow too fast with respeet to n. This eondition is very mild, beeause it ean 
allow p to grow at a nonpolynomial rate of n. In partieular, if d = then eondition (C2) 

is satisfied if logp = o(n^"). Condition (C3) guarantees that the nonzero eoeffieients are bounded 
away from 0, whieh is a eommon assumption in the literature. The lower bound of dmin tends to 
0 under eondition (C3). Condition (C4) is required sueh that all the elasses ean be separated from 
eaeh other. If eondition (C4) is violated, even the Bayes rule eannot work well. 

In the following theorems, we let C denote a generie positive eonstant that ean vary from plaee 
to plaee. 


Theorem 1. 1. Under conditions (CO)-(Cl), there exists a generic constant M such that, if 

9 

X < mini ^212^ M(1 — k)}, then with a probability greater than 

8(p 

fi 

1 - CpdeM-Cn-—) - C'iCexp(-C'—) - Cp{K - 1) exp(-C'n-) (19) 
Kd^ K 


we have thatV = V, and \\0k — ^ ApXfor k = 2,..., K. 

2. If we further assume conditions (C2)-(C3), we have that if { - 

n 

then with probability tending to 1, we have V = T>, and ||0fc — < ApXfor k = 

2,...,iC. 


Next, we show that our proposal is a eonsistent estimator of the Bayes rule in terms of the 
miselassifieation error rate. Define 


Rn = Pr(F(0fc, TTfc, k = 1,..., K) Y \ observed data), 


13 





where Y(9^, itk, k = 1,..., K) is the predietion by our method. Also define R as the Bayes error. 
Then we have the following conclusions. 


Theorem 2. 1. Under conditions (CO)-(Cl), there exists a generic constant Mi such that, if 

9 

X < min{-^^, Mi(l — n)}, then with a probability greater than 

oLp 


Tl 6 ^ 

1 - Cpdexp{-Cn-—) - CK exp(-C—) - Cp{K - 1) exp(-C'n—) (20) 

Kd^ K 


we have 

\Rn -R\< ( 21 ) 

for some generic constant Mi. 

2. Under conditions (C0)-(C4), ifX^O, then with probability tending to 1, we have 


Rfi —y R. 


Remark 2. Based on our proof we can further derive the asymptotic results by letting K (the 
number of classes) diverge with n to infinity. We only need to use more cumbersome notion and 
bounds, but the analysis remains pretty much the same. To show a clearer picture of the theory, we 
have focused on the fixed K case. 


4 Numerical Studies 

4.1 Simulations 

We demonstrate our proposal by simulation. For comparison, we include the sparse optimal scor¬ 
ing and i-i penalized Fisher’s discriminant analysis in the simulation study. Four simulation models 
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are considered where the dimension p = 800 and the training set has a sample size n = 75/’^, where 
K is the number of classes in each model. We generate a validation set of size n to select the tuning 
parameters and a testing set of size 1000 for each method. Recall that (3^ = We specify 

/3fc and S as in the following four models and then let = S/3fc. For simplicity, we say that a 
matrix S has the AR(p) structure if ajk = for j, fc = 1,..., p; on the other hand, E has the 

CS(p) structure if ajk = p for any j ^ k and ajj = 1 for j = 1,..., p. 

Model 1: 

K = A, I3jk = 1.6 for j = 2k — 1, 2k; k = 1,K and Pjk = 0 otherwise. The covariance 
matrix S has the AR(0.5) structure. 

Model 2: 

K = 6 , l3jk = 2.5 for j = 2k — l,2k;k = 1,..., J-f and jSjk = 0 otherwise. The covariance 
matrix S = I 5 (g) fi, where f2 has the CS(0.5) structure. 

Model 3: 

K = 4, (3jk = k + Ujk for j = 1,..., K, where Ujk follows the uniform distribution over the 
interval [—1/4,1/4]; [3jk = 0 otherwise. The covariance matrix S has the CS(0.5) structure. 

Model 4: 

K = 4, {3jk = k + Ujk for j = 1,..., 4, where Ujk follows the uniform distribution over the 
interval [—1/4,1/4]; jSjk = 0 otherwise. The covariance matrix S has the CS(0.8) structure. 

Model 5: 

K = A, /32,i = ... = /32,8 = 1-2, /?3,i = ... = /53,4 = —1.2, = ... = = 1-2, 

I^A, 2 j-i = —1.2, / 34 , 2 j = 1-2 for j = 1,..., 4; f5jk = 0 otherwise. The covariance matrix S has the 
AR(0.5) structure. 
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Model 6: 


K — A, (52,1 — ... — /32,8 — 1.2, (5‘i,i — ... — (52,,A — —1.2 — ... — /^s^g — 1.2, 

/l4,2i-i = —1.2, (54,, 2 j = 1.2 for j = 1,..., 4; = 0 otherwise. The eovarianee matrix S has the 

AR(0.8) strueture. 

The error rates of these methods are listed in Table [IJ To compare variable selection perfor¬ 
mance, we report the number of correctly selected variables (C) and the number of incorrectly 
selected variables (IC) by each method. We want to highlight two observations from Table 1. First, 
our method is the best across all six models. Second, our method is a very good approximation 
of the Bayes rule in terms of both sparsity and misclassification error rate. Although our method 
tends to select a few mo re variables besides the true ones, this can be improved by using the adap¬ 


tive group lasso penalty (lBachll2008l) . Because the other two methods do not use the adaptive lasso 


penalty, we do not include the results of our method using the adaptive group lasso penalty for a 
fair comparison. 


4.2 A real data example 


We further demonstrate the application of our method on the IBD dataset (IBurczynski et al.ll2006h . 
This dataset contains 22283 gene expression levels from 127 people. These 127 people are either 
normal people, people with Crohn’s disease or people with ulcerative colitis. This dataset can be 
downloaded from Gene Expression Omnibus with accession number GDS1615. We randomly split 
the datasets with a 2:1 ratio in a balanced manner to form the traini ng set and the testing set. 


It is known that the marginal f-test screening (IFan & Fan 


2008h can greatly speed up the com¬ 


putation for linear discriminant analysis in binary problems. For a multiclass problem the natural 
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Bayes 

Our 

Witten 

Clemmensen 

Bayes 

Our 

Witten 

Clemmensen 



Model 1 



Model 2 


Error(%) 

Il.O 

12.4 

15.5 

13 

13.3 

15.2 

31.7 

17 


(0.06) 

(0.07) 

(0.07) 

(0.06) 

(0.05) 

(0.07) 

(0.20) 

(0.08) 

C 

8 

8 

8 

8 

12 

12 

12 

12 



(0) 

(0) 

(0) 


(0) 

(0) 

(0) 

IC 

0 

10 

126 

5 

0 

15 

19.5 

16 



(0.6) 

(4.9) 

(0.4) 


(0.7) 

(1.5) 

(0.3) 



Model 3 



Model 4 


Error(%) 

8.8 

9.4 

14.1 

12.7 

5.3 

5.7 

7 

7.6 


(0.06) 

(0.09) 

(0.06) 

(0.08) 

(0.06) 

(0.08) 

(0.05) 

(0.07) 

C 

4 

4 

4 

4 

4 

4 

4 

4 



(0) 

(0) 

(0) 


(0) 

(0) 

(0) 

IC 

0 

3 

796 

30 

0 

4 

796 

30 



(0.4) 

(0) 

(0.2) 


(0.5) 

(0) 

(2.2) 



Model 5 



Model 6 


Error(%) 

8.3 

9.5 

17.9 

13.6 

14.2 

17.4 

23.4 

24.8 


(0.05) 

(0.07) 

(0.14) 

(0.09) 

(0.06) 

(0.08) 

(0.09) 

(0.09) 

C 

8 

8 

8 

8 

8 

8 

8 

6 



(0) 

(0) 

(0) 


(0.0) 

(0) 

(0.1) 

IC 

0 

6 

97 

4 

0 

0 

4 

3 



(0.9) 

(2.8) 

(0.5) 


(0) 

(0.5) 

(0.3) 


Table 1: Simulation results for Models 1-6. The two competing methods are denoted by the first 
author of the original papers. In particular, Witten’s method is the penalized Fisher’s discrim¬ 
inant analysis, and Clemmensen’s method is the sparse optimal scoring method. The reported 
numbers are medians based on 500 replicates. Standard errors are in parentheses. The quantity C 
is the number of correctly selected variables, and IC is the number of incorrectly selected variables. 
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Our 

Witten 

Clemmensen 

Error(%) 

7.32(0.972) 

21.95(1.10) 

9.76(0.622) 

Fitted Model Size 

25(0.7) 

127(0) 

27(0.5) 


Table 2: Classification and variable selection results on the real dataset. The two competing meth¬ 
ods are denoted by the first author of the original papers. In particular, Witten’s method is the 
penalized Fisher’s discriminant analysis, and Clemmensen’s method is the sparse optimal scoring 
method. All numbers are medians based on 100 random splits. Standard errors are in parentheses. 


generalization of f-test screening is the F-test screening. Compute the F-test statistic for each Xj 
defined as 




-/^i) /(G”-1) 


where fij is the sample grand mean for Xj and Ug is the within-group sample size. Based on the 
F-test statistic, we define the F-test screening by only keeping the predictors with F - test statistics 


among the (i„th 


2010 , 


Mai & Zou 


argest. As recommended by many researchers (IFan & Fan 


2008 . 


Fan & Song 


2013Qt) . dn can be the same as the sample size, if we believe that the number of 


truly important variables is much smaller than the sample size. Therefore, we let dn = 127 for the 
current dataset. 

We estimate the rules given by sparse optimal scoring, li penalized Fisher’s discriminant anal¬ 
ysis and our proposal on the training set. The tuning parameters are chosen by 5 fold cross val¬ 
idation. Then we evaluate the classification errors on the testing set. The results based on 100 
replicates are listed in Table 2. It can be seen that our proposal achieves the highest accuracy with 


the sparsest classification rule. This again shows that our method is a very competitive classifier. 
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5 Summary 


In this paper we have proposed a new formulation to derive sparse multiclass discriminant clas¬ 
sifiers. We have shown that our proposal has a solid theoretical foundation and can be solved by 
a very efficient computational algorithm. Our proposal actually gives a unified treatment of the 
multiclass and binary classification problems. We have shown that the solution path of the binary 
version of our proposal is equivalent to that by ROAD and DSDA. Moreover, LPD is identical 
to the Dantzig selector formulation of our proposal for the binary case. In light of this evidence, 
our proposal is regarded as the natural multiclass generalization of those proven successful binary 
sparse LDA methods. 


Appendices 

A.l Connections with Fisher’s discriminant analysis 

For simplicity, in this subsection we denote rj as the discriminant directions defined by Fisher’s 
discriminant analysis in dH), and 6 as the discriminant directions defined by Bayes rule. Our 
method gives a sparse estimate of 9. In this section, we discuss the connection between 6 and 
77 , and hence the connection between our method and Fisher’s discriminant analysis. We first 
comment on the advantage of directly estimating 0 rather than estimating rf. Then we discuss how 
to estimate 77 once 6 is available. 

There are two advantages of estimating 6 rather than 77 . Firstly, estimating 0 allows for simul¬ 
taneous estimation of all the discriminant directions. Note that dH) requires that rdHrh = 0 for 
any I < k. This requirement almost necessarily leads to a sequential optimization problem, which 
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is indeed the ease for sparse optimal seoring and penalized Fisher’s diseriminant analysis. In 
our proposal, the diseriminant direetion 0 ^ is determined by the eovarianee matrix and the mean 
veetors within Class k, but is not related to Qi for any I ^ k. Henee, our proposal ean simulta¬ 
neously estimate all the direetions by solving a eonvex problem. Seeondly, it is easy to study the 
theoretieal properties if we foeus on 0. On the population level, 0 ean be written out in explieit 
forms and henee it is easy to ealeulate the differenee between 0 and 0 in the theoretieal studies. 
Sinee 77 do not have elosed-form solutions even when we know all the parameters, it is relatively 
harder to study its theoretieal properties. 

Moreover, if one is speeifieally interested in the diseriminant direetions t], it is very easy to 
obtain a sparse estimate of them onee we have a sparse estimate of 0. For eonvenienee, for any 
positive integer m, denote Om as an m-dimensional veetor with all entries being 0 , Im as an m- 
dimensional veetor with all entries being 1, and Im as the m x m identity matrix. The following 
lemma provides an approaeh to estimating 77 onee 0 is available. The proof is relegated to Seetion 
A.2. 

Lemma 2. The discriminant directions rj contain all the right eigenvectors o/0on(5J correspond¬ 
ing to positive eigenvalues, where 6q = (Op, 0), 11 = li^- — and Sq = {pi—fi, ..., pK — p) 
with p, = Y.k=i ^kPk- 

Therefore, onee we have obtained a sparse estimate of 0, we ean estimate 77 as follows. Without 
loss of generality write 0 = (0f, 0)"^, where f> = {j : O.j ^ 0}. Then Oq = (0, 0). On the other 
hand, set = {pi — p, , px — P) where pk are sample estimates and p = Yl,k=i ^kPk- It 
follows that 0on5o = ((0Q ^115(1'^)'^, 0)'^. Consequently, we ean perform eigen-deeomposition 
on to obtain r)p. Beeause P is a small subset of the original dataset, this deeomposition 
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will be computationally efficient. Then rj would be ( 17 ^, 0)'^. 


A.2 Technical Proofs 


Proof of PropositionU} We first show (fTSl) . 
For a vector 6 E Define 


(0,A) = -0"S0-(/i2-Ai)"0 + A||0||i, 


^MSDA 

+A||6»||i 


( 22 ) 

(23) 


Set 0 = co(A) ^0^®°^(A). Since 6 '^{(i 2 — Mi) = 1, it suffices to check that, for any O' such 
that { 6 'Y{fi 2 — Ml) = we have p;^^) < Now for any such O', 


-MSDA 


(co(A)0',A) = co(A)2L^o^°(0', 


|co(A)| 


) - co(A) 


Similarly, 


'MSDA 


(co(A)0,A) = co(A)^L^o^°(0, 


|co(A)| 


) - co(A). 


Since L^®°^(co(A)0, A) < L^^^^(cn( X)0', A), we have (fT^ 


Mai & Zoul (I2013 m) . we have 
A , 


On the other hand, by Theorem 1 in 

gDSDA^^^ = Ci(A)0^°^°( 

Therefore, 


n|ci(A)| 


0 


ROAD/ 


n|ci(A)|A 3^ ^^DSDA/^|ci(A)|A3^ 


Cl 


|co(A)| 

= (ci(aA))-'0°^°^(aA) 


(24) 


(25) 


(26) 


(27) 

(28) 
(29) 


Combine (l29l) with (fT5l) and we have (fT^ . 


□ 
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Proof of Lemma\T\ We start with simplifying the first part of our objeetive funetion, — 

{ilk - 

First, note that 


1 „ I P 


Z,m=l 


rf^kj^jj + 9 E ^kj^km^jm + 9 ^kl^km^l r 


¥j 


mj^j 




(30) 

(31) 

(32) 


Beeause aij = aji, we have OuOkj^ij = OkjOkm^jm- It follows that 


-O'i'EOk — -^^kj^jj ~^'^^^kj(^kl^lj + -^ dkldkm^lm (33) 

l¥=j l¥=j,m^j 

Then reeall that 6^ = fik — fii- We have 

p 

{ilk - iiiVOk = = Spkj + Y, (34) 

1=1 i^j 

Combine (l3^ and (l34l) and we have 

^0l±9k - {ilk - iiiVOk (35) 

= ’*■ + - Y^ GklOkm^lm — Sj6kj — Y^^l^kl (36) 

— 2^kj^jj + ^l,j(^kl — Sj)0kj + - dklOkm^lm — SfOkl (37) 

l¥=j rn^jl¥=j l¥=j 


Note that the last two terms does not involve 0 j. Therefore, given {0 y,j’ 7 ^ j}, the solution 
of 6 j is defined as 


K 


arg min Y^o^'kj^n + iY “ ^jWkj} + Alie.^l, 


whieh is equivalent to (fTTl) . It is easy to get (fTSl) from (fTTI) (lYuan & Lin 


20061 ). 


□ 
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In what follows we use C to denote a generic constant for convenience. 

Now we define an oracle “estimator" that relies on the knowledge of V for a specific tuning 
parameter A: 


K 


0”'^'='® = arg min - (j-i^vYOk^v] + (38) 

Ok 

k=2 j£V 

The proof of Theorem [T| is based on a series of technical lemmas. 


Lemma 3. Define as in (l3^ . Then Ok = 0),k = 2,..., K is the solution to (fT^ 

if 

^ k=2 

Proof of Lemma\^ The proof is completed by checking that Ok = {G‘kv^’^{^)^ 0) satisfies the KKT 
condition of (fTO . □ 


Lemma 4. For each k, T^xic,T)'^v^v{l^k,v — Mi,x>) = /^fc,x)C — 

Proof of Lemma^ For each k, we have Ok,vc = 0. By definition, Oj)C = (S“^(/^fc — /^i))x)C. 
Then by block inversion, we have that 


— ~(Sx)C x)C — 


V 


'^v,v^v,v‘^ 




'^VC ,V^V,viP'k,V — 0-1,v) — (Mfc.pC — 


and the conclusion follows. 


□ 


Proposition 2. There exist a constant Cq such that for any e < Cq we have 


pr{|(Afcj - fij) - (Fkj - Fij)\ > e} < C'exp(-C'—) 


+ Cexp( 



(40) 


/c = 2,...,iT, j = l,...,p; 

pr(|(Tij - (Tyl > e) < 2exp(-C—) + 2exp(-—), i,j = l,...,p. (41) 
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Proof of Proposition^ We first show (l40l) . Note that, by Chemoff bound 


pr(|/ifcj - pkj\ >e) < E(pr(l/ikj - pkj\ > e | ^)) < -E(Cexp(-Cnfce^)) 

/ ^ / Cn. 

< 2exp(-C'—)+ 2exp(-—). 

A similar inequality holds for pij, and (l40l) follows. 

For (HTI) . note that 


K 


^ij 

n — K 


z-fE E (xr - hMXY - hi) 


k=l Y^=k 
K 


K 


k=l Y^=k k=l 


K 


= cr. 


( 0 ) 

ij 


“f Pki)(^Pkj Pkj')- 

n — K ^—' 


k=l 


Now by Chemoff bound, pr(|(j|^°^ — aij\ > e) < Cexp{—Cne^). Combining this faet with (1^ . 
we have the desired result. 


□ 


Now we eonsider two events depending on a small e > 0: 

A{e) = {{cTij-aij\<^ for any i = 1,-■■ ,p and j eV}, 
B{e) = {I {pkj - pij) - {pkj - Pij) I < e for any k and j}. 


By simple union bounds, we ean derive Lemma 4 and Lemma 5. 


Lemma 5. There exist a constant cq such that for any e < eo we have 

1. pr(/l(e)) > 1 - CpdeM-Cn^) - CA'exp(-^); 

2. pr(B(e)) > 1 - Cp(K - l)exp(-C^) - CAexp(-^),- 
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3. pr(y4(e) fl B{e)) > 1 — 7 (e), where 


T) ID T 1 

7 ( 6 ) = Cprfexp(-C'—) + Cp{K - 1) exp(-C'—) + 2CKexp(-—). 

Lemma 6. Assume that both A{e) and B{e) have occurred. We have the following conclusions: 

■“ Sx>^x>||oo < 

X) ~ Sx>c,-dI|oo < 

||(/ifc - fii) - (/Xfc - Mi)l|oo < e; 

IKAfc,® — Ai,©) — {l^k,v — l^i,v)\\i < e. 

Lemma 7. If both A(e) and Bie) have occurred for e < —,we have 

II^xic,xi(^xi,x>) ^ ~ ^llcxD < -zr ^—• 

I — pe 

Proof of Lemma^. Let = \\'Sv,v-'^v,v\\oo,V 2 = IlSpc^i, - Sx,c_x)|loo and 773 = \\{t,xi,v)~^- 

(Sx),x')~^lloo- First we have 

73 < ||(Sd,d) ^lloo X ||(Sx',d — 5 ]x)^x')||cx) X IKSd^x)) ^||cx) = (<75 + 73)^^71- 


On the other hand, 






^lloo + ||Sx)C^X) — Sx'C x)||oo X ||(Sx)_x') ^ ~ (Sx),X)) ^ 

+11 SdC^X) — Sx'C,X)||oo X \\{Tix>,v) ^lloo 
+ 11 Sl’C,I’||oo X ||(Sx>_x)) ^ — ('Sx>,v) ^lloo 


< 7273 + 72 + + +73- 
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By iprii < 1 we have 773 < — ^prji) ^ and hence 




-1 


Sr>C,X)(^25,I> 


i-ii 


< 


1 — ipe 


□ 


Lemma 8. Define 


^k,v — ^v]vit^k,v - (42) 

Then ||e»p - Stplli < 

’ i — (pe 

Proof of Lemma^ By definition, we have 

\\^'D,v{l^k,V — P'l,v) — '^v[viP'k,V — Ml, d) 111 
^ W^V^V ~ ^D/dIIiIKMA:,!? — fl,v) — {P'k,V “ Ml,I 5 )l|l 

+ 11 ^X',X>l|l|l(Mfc,I5 Ml,!?) ip-k^v Ml,I’)l|l + II^D,!? ^X>,X'lllllMfc,I> Ml,dll 

^ cpejl + cpA) 

~ 1 — ipe 


□ 


Lemma 9. If A(e) and B{e) have occurred for e < min{^, —^}, then for all k 

Proof of LemmaM Observe — Mi,x>) “ Therefore, 


II0 


oracle n || 

k,V ^k,V\\GO 


— ll^fc,D ^fc,7?|loo + A||Sp 25 ^D,X> 


1II ^fc,X>||oo 


+ A||S 


-1 


X>,dI|1 || ^k,'D\\QO 


where 0° ^ is defined as in (|4^ . Now ||tfc x)|loo < 1 and we have 

(^e(l + (/jA) + \{p 


\\dZ 


lOracle 

kj) 


0 


k,T>\\oo + 


1 — ipe 


< Ap\. 
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□ 


Lemma 10. For a sets of real numbers {ai,..., on}, a? < < L then + ^ 


as long asb < — 

VN 


Proof. By the Cauchy-Schwartz inequality, we have that 


N 


N 


N 


'^{oi + hY = + 2 


Nh^ 


2 = 1 


2=1 


2 = 1 


< 

< 


N 

E“? 

2=1 

+ 2kVMF + Nh^ 


A 


N 


(^af) ■ Nh^ + Nh^ 


2=1 


(43) 

(44) 

(45) 


which is less than 1 when b < — 

Vn 

We are ready to complete the proof of Theorem [TJ 


□ 


Proof of TheoremU} We first consider the first conclusion. For any A < ande < min{^, —— 

consider the event A{e) fl 5(e). By Lemmas[3l[5]&|9]it suffices to verify (l39l) . 

For any j G V^, by Lemma 0 ] we have 




^ \^VC.v0k^V “ i^vc,v^k,v)j\ + \{Fkj — Flj) — ibkj — Fij)\ 

^ ,v^k,v ^)i “ {'^v(^,v^k,v)j \ + e 

^ \{'^v‘^,v^^k^)j ~ {'^v‘^,v^k,v)j \ + e + X\{'Ej)C j\ 
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< \\i'^vc,v)j ~ i'^VC,v)j\\l\\^k,V ~~ dk,v\\oo + ll^fc.ullooll (Sx)C X))j — (Sx)C x>)j||l 

+ ll(^DC^X,)j||l||0°^^ — 0fc,l)||oo + e 

< Ce. (46) 

I — (S 25 C 

+11 ^XiC,X>^D|x)||cx)||tfc,Xi — tfc,X>||oo + 11^1)0^X1^25^25 — Sx>C,X>5^X)!r>II 00 1 (tfc,X>)j | 


Therefore, 


\ikj tkj\ 


< 

< 


l ^fcj||6>.j|| - 0kj\\0.j\\ I 

ll^.llllill 

l^kj - ^fcilll^.jll + ^maxll^.j - O.j 

11^.2 II 11 ^. II 

0 minV(i^-l) 


A|( ^V^,v'^v[v^k,v)j I 

< A|(Ex>c x>Sx,]x)tfc,D)il + A(-— -h r]*- - 

l-9?e e^iWK-1 

< A|(Sx>c^x>5^x>!D^fc,i’)il + 


Under eondition (CO), it follows from (l46l) and (l48l) that 


\{'^V‘^,V^k,V ^)i “ iP'kj — Alj)! < ,v'^v]v^k,v)j \ + CA^ 


(47) 

(48) 


(49) 


Combine eondition (CO) with Lemma [TOl we have that, there exists a generic constant M > 0, 
such that when X < M{1 — k), (l39l) is true. Therefore, the first conclusion is true. 
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Under conditions (C2)-(C4), the second conclusion directly follows from the first conclusion. 


□ 


Proof of Theorem^ We first show the first conclusion. Define Y ( 02 ,..., Or) as the prediction by 
the Bayes rule and Y( 62 ,..., Or) as the prediction as the prediction by the estimated classification 
rule. Also define = (X - ^yOk + log(7rfc) and 4 = (X - ^yOk + log(7rfc). 

Define C(e) = {|^fc — tt^I < minjminfc 7rfc/2, e}}. By the Bernstein inequality we have that 
Pr(U(e)) < Uexp(-Un). 

Assume that the event A(e) n B(e) fl U(e) has happened. By Lemma[5l we have 

y n y 

Pr(A(e)ni?(e)nU(e)) > l — Cpdexp(—Cn——)—CK exp(—U-—) —Up(iP —1) exp(—Un—) 

Kd‘^ K 

(50) 

For any cq > 0, 


Rn-R < PT(Y(02,...,0R)yY(e2,...,OR)) 

< 1 - Pr(|4 - lk\ < eo/ 2 , 14 - 4/| > eo, for any k, k') 

< Pr(|4 — 4| > eo/2 for some k) + Pr(|/fc — 4'| < eo for some k, k'). 

Now, for X in each class, 4 — 4' is normal with variance (Ok — 6k'yYi(6k — Ok")- Therefore, 


Pr(|4 — 4'| < eo for some A;, 4) < ^ Pr(|/fc — 4'| < eo | U = 

k" 

^ _ Cfo _ 

" m-0k')-n0k-0k>)YR 

< CK^eo. 


On the other hand, conditional on training data, h—h is normal with mean u(k, k') = i-ikffik — 
fik) + ^(Rk^k - fyOk) + log TTfc - log TTfc and variance (Ok - Ok)'^'S(0k - Ok) within class k'. By 


29 




Markov’s inequality, we have 


Pr(|4-4| > eo/2forsome/c) = Pr(|4 - 4| > eo/2 \ Y = k') 


k' 


^ ma.yik{Ok-OkY^{Ok-Ok)^ 
^ ieo-uiYk')Y 


Moreover, under the event A{e) n B{e) fl C{e) 


max{6k - 6kY'E{6k - Ok) < CX 


\u{k,k')\ < \fJ'k'{Ok - 6k)\ +-\^YiYk - ^J>k)\ 


2 ' 

< CiA 


- Afc)^^fc| + I logTTfc - logTTfcl 


Henee, piekeo = M 2 A^/^ suehthateo > C'iA/2, for (Pi in (|5TI). ThenPr(|4— 4 I > eo/2 for some k) < 
It follows that \Rn — R\ < MiA^/^ for some positive eonstant Mi. 

Under Conditions (C2)-(C4), the seeond eonelusion is a direet eonsequenee of the first eonelu- 
sion. □ 


We need the result in the following pr oposition to show Lemma |3l A slightly different version 


of the proposition has been presented in iFukunagal (1199011 (Pages 446-450), but we inelude the 


proof here for eompleteness. 


Proposition 3. The solution to dH) consists of all the right eigenvectors ofE corresponding 
to positive eigenvalues. 

Proof. For any 77 ^, set It follows that solving (Hj) is equivalent to finding 


(u 4 ..., u^_i) = argmaxu^E s.t. = 1 and = 0 for any I < k. 

'Ik 

( 51 ) 
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and then setting It is easy to see that ul,... are the eigenveetors eorre- 

sponding to positive eigenvalues of By Proposition IH let A = and 

B = and we have that ry eonsists of all the eigenveetors of eorresponding to 

positive eigenvalues. □ 


Proposition 4. AMardia et al.\ nl97^} . Page 468, Theorem A.6.2) For two matrices A and B, «/x is 


a non-trivial eigenvector of ATQ for a nonzero eigenvalue, then y = Bx is a non-trivial eigenvector 


o/BA. 


Proof of Lemma^ Set <5 = (Op, < 5 ) and do = ( a^i — h, , hk — p)- Note that 61k = Yl!k =2 P'k — 
{K - l)/xi = K{fi- /xi). Therefore, 6o = 6 - ^SIkIk = ^(lir - 

Then, sinee Oq = we have = S“^(5o and = S~^<5o<5o. By Proposition [3l 

we have the desired eonelusion. □ 
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